The genome sequences of the marine diatom Epithemia pelagica strain UHM3201 (Schvarcz, Stancheva & Steward, 2022) and its nitrogen-fixing, endosymbiotic cyanobacterium

We present the genome assembly of the pennate diatom Epithemia pelagica strain UHM3201 (Ochrophyta; Bacillariophyceae; Rhopalodiales; Rhopalodiaceae) and that of its cyanobacterial endosymbiont (Chroococcales: Aphanothecaceae). The genome sequence of the diatom is 60.3 megabases in span, and the cyanobacterial genome has a length of 2.48 megabases. Most of the diatom nuclear genome assembly is scaffolded into 15 chromosomal pseudomolecules. The organelle genomes have also been assembled, with the mitochondrial genome 40.08 kilobases and the plastid genome 130.75 kilobases in length. A number of other prokaryote MAGs were also assembled.


Background
Epithemia pelagica is a single-celled marine pennate diatom belonging to the family Rhopalodiaceae.Similar to other species within this family (Nakayama et al., 2011;Prechtl et al., 2004), E. pelagica hosts nitrogen-fixing cyanobacterial endosymbionts, which are thought to be in the early stages of becoming an organelle (Kneip et al., 2007;Nakayama et al., 2014).E. pelagica was first isolated from open ocean waters north of Hawai'i, in the North Pacific Ocean (Schvarcz et al., 2022).While this new diatom species has yet to be reported elsewhere, metagenomic analyses have detected gene sequences matching E. pelagica's symbiont in tropical and subtropical marine habitats across the globe, suggesting this symbiosis is more widespread than currently reported.
E. pelagica is characterised by small, solitary cells measuring 6-18 µm long and 5-10 µm wide.Cells are strongly dorsiventral and asymmetrical along the apical axis, and valves are lunate with rounded apices, having a convex dorsal margin and concave ventral margin.E. pelagica differs from other species in the genus Epithemia by its minute size, weakly silicified frustules with delicate costae, and very fine striae that are not resolvable with light microscopy.E. pelagica cells typically harbor one or two endosymbionts, but cell cultures can lose their symbionts when grown for extended periods in nitrogen-rich medium.The endosymbionts lack fluorescent photosynthetic pigments and tend to be located next to the host cell's nucleus.
The genome assemblies for E. pelagica and its endosymbiont will be a valuable resource for furthering our understanding of endosymbiosis and organellogenesis.These genomes will reveal the extent of endosymbiotic gene transfer to the diatom host and will guide future investigations of hostsymbiont physiology, including the transfer of key metabolites.The genome of E. pelagica will also aid phylogenomic and evolutionary studies of the diatom order Rhopalodiales.

Genome sequence report
The genome of Epithemia pelagica strain UHM3201 was sequenced from cultured cells (Figure 1) isolated from seawater at Station ALOHA in the subtropical North Pacific Ocean (Schvarcz et al., 2022).A total of 298-fold coverage in Pacific Biosciences single-molecule HiFi long reads was generated.Primary assembly contigs were scaffolded with chromosome conformation Hi-C data.Manual assembly curation corrected eight missing joins or mis-joins and removed one haplotypic duplication, reducing the scaffold number by 14.81%.
The final assembly has a total length of 60.3 megabases (Mb) in 21 sequence scaffolds with a scaffold N50 of 3.9 Mb (Table 1).The snail plot in Figure 2 provides a summary of the assembly statistics, while the distribution of assembly scaffolds on GC proportion and coverage is shown in Figure 3.The cumulative assembly plot in Figure 4 shows curves for subsets of scaffolds assigned to different phyla.Most (99.54%) of the assembly sequence was assigned to 15 chromosomal-level scaffolds.Chromosome-scale scaffolds confirmed by the Hi-C data are named in order of size (Figure 5; Table 2).While not fully phased, the assembly deposited is of one haplotype.Contigs corresponding to the second haplotype have also been deposited.The mitochondrial and plastid genomes were also assembled (40.08 and 130.75 kilobases (kb) in size, respectively) and can be found as contigs within the multifasta file of the genome submission.

Sample acquisition and nucleic acid extraction
A sample of Epithemia pelagica (specimen ID DU0000022, ToLID uoEpiScrs1) was obtained from cultured cells      The assembly of the endosymbiont uoEpiScrs1.Cyanobac-terium_sp_1.1 was produced using the following pipeline: to identify cyanobacterial reads, BLAST was run of the PacBio HiFi reads of the uoEpiScrs1 sample against NCBI  sequence NZ_AP018341.1 (cyanobacterium endosymbiont of Rhopalodia gibberula isolate RgSB Namiki Park) with settings "outfmt 6 -max_target_seqs 10 -max_hsps 1 -evalue 1e-25 -dust yes -lcase_masking".The tool seqkit 2.2.0 was used to isolate reads yielding BLAST hits.These reads were then assembled with Flye 2.9-b1768 with the following settings: --pacbio-hifi --meta --scaffold --keep-haplotypes.prokka 1.14.6 was used to annotate the contigs.NCBI BLAST against the nt database was run with the 16S rRNA gene sequences from this bacterial assembly.The only circular contig in the output of the assembler (contig_39) was identified as cyanobacterial (top BLAST match: "Cyanobacterium endosymbiont of Rhopalodia gibberula DNA, isolate: RgSB").BUSCO 5.2.2 with bacteria_odb10 lineage was run with this contig.The annotation was added by the NCBI Prokaryotic Genome Annotation Pipeline.

Wellcome Sanger Institute -Legal and Governance
The materials that have contributed to this genome note have been supplied by a Tree of Life collaborator.The Wellcome Sanger Institute employs a process whereby due diligence is carried out proportionate to the nature of the materials themselves, and the circumstances under which they have been/are to be collected and provided for use.The purpose of this is to address and mitigate any potential legal and/or ethical implications of receipt and use of the materials as part of the research project, and to ensure that in doing so we align with best practice wherever possible.The overarching areas of consideration are: • Ethical review of provenance and sourcing of the material  The data note on genome sequencing of a marine diatom is well presented and soundly analyzed.
Besides the genome description, the authors also provided the metagenomes using the "bin" strategy.As data report, there is no discussion or an evolutive perspective, which is expected.The only (very minor) flaw is that the metagenome results could also bring some information on the core genes found, for example.This information is also descriptive and should have been added to the metagenome description.

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound?Yes

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format?Yes

Figure 2 .
Figure 2. Genome assembly of Epithemia pelagica, uoEpiScrs1.2:metrics.The BlobToolKit snail plot shows N50 metrics and BUSCO gene completeness.The main plot is divided into 1,000 size-ordered bins around the circumference with each bin representing 0.1% of the 60,520,547 bp assembly.The distribution of scaffold lengths is shown in dark grey with the plot radius scaled to the longest scaffold present in the assembly (6,983,076 bp, shown in red).Orange and pale-orange arcs show the N50 and N90 scaffold lengths (3,856,736 and 3,186,670 bp), respectively.The pale grey spiral shows the cumulative scaffold count on a log scale with white scale lines showing successive orders of magnitude.The blue and pale-blue area around the outside of the plot shows the distribution of GC, AT and N percentages in the same bins as the inner plot.A summary of complete, fragmented, duplicated and missing BUSCO genes in the stramenopiles_odb10 set is shown in the top right.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/Epithemia%20pelagica/dataset/uoEpiScrs1_2/snail.

Figure 5 .
Figure 5. Genome assembly of Epithemia pelagica, uoEpiScrs1.2:Hi-C contact map of the uoEpiScrs1.2assembly, visualised using HiGlass.Chromosomes are shown in order of size from left to right and top to bottom.An interactive version of this figure may be viewed at https://genome-note-higlass.tol.sanger.ac.uk/l/?d=ek4BBaV4Smikzb_1OOkKpQ.

Figure 7 .
Figure 7. Metagenome of Epithemia pelagica strain UHM3201.Blob plot of mapped base coverage against GC proportion metaMDBG assembled contigs.Contigs are coloured by assigned taxonomic class where NA represents unbinned and eukaryotic sequences.Circles are sized in proportion to length on a square-root scale, ranging from 894 to 7,080,000.The assembly has been filtered to exclude records with mapped base coverages < 1. Histograms show the distribution of record length sum along each axis.

©
2024 Andrade S. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Sonia Andrade Universidade de Sao Paulo, São Paulo, State of São Paulo, Brazil

Table 1 . Genome data for Epithemia pelagica strain UHM3201, uoEpiScrs1.2. Project accession data
(Narváez-Gómez et al., 2023) are adapted from column VGP-2020 of "Table1: Proposed standards and metrics for defining genome assembly quality" from(Rhie et al., 2021).(Figure1)isolatedfromseawateratStationALOHA in the subtropical North Pacific Ocean.The cells were collected, and the diatom species was identified by Christopher Schvarcz (University of Hawai'i at Mānoa), Rosalina Stancheva (George Mason University), and Grieg Steward (University of Hawai'i at Mānoa).Cell pellets were collected by centrifugation (4,000 × g for 10 min), followed by transfer to a cryovial, and then snap-frozen in liquid nitrogen.High molecular weight (HMW) DNA was extracted at the Tree of Life laboratory, Wellcome Sanger Institute (WSI), following a sequence of core procedures: sample preparation; sample homogenisation; HMW DNA extraction; DNA fragmentation; and DNA clean-up.The uoEpiScrs1 sample was prepared on dry ice(Jay et al., 2023).The cells were cryogenically disrupted using the Covaris cryoPREP ® Automated Dry Pulverizer(Narváez-Gómez et al., 2023).HMW DNA

the rationale for creating the dataset(s) clearly described? Yes Are the protocols appropriate and is the work technically sound? Yes Are sufficient details of methods and materials provided to allow replication by others? Yes Are the datasets clearly presented in a useable and accessible format? Yes Competing Interests:
No competing interests were disclosed.

have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. Are the protocols appropriate and is the work technically sound? Yes Are sufficient details of methods and materials provided to allow replication by others? Yes Are the datasets clearly presented in a useable and accessible format? Yes Competing Interests:
No competing interests were disclosed.

have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
Reviewer Report 08 June 2024 https://doi.org/10.21956/wellcomeopenres.23805.r82798